4  Data Structures

One of R’s most powerful features is its ability to deal with tabular data - such as you may already have in a spreadsheet or a CSV file.

Let’s start by using R to create a dataset, which we will then save in our data/ directory in a file called feline-data.csv. First, let’s create the dataset in R using the data.frame() function:

cats <- data.frame(coat = c("calico", "black", "tabby"),
                    weight = c(2.1, 5.0, 3.2),
                    likes_string = c(1, 0, 1))

Then we can save cats as a CSV file. It is good practice to call the argument names explicitly so the function knows what default values you are changing. Here we are setting row.names = FALSE. Recall you can use ?write.csv to pull up the help file to check out the argument names and their default values.

write.csv(x = cats, file = "data/feline-data.csv", row.names = FALSE)

You should now see that you have a new file, feline-data.csv, in your data/ folder, whose contents look like:

coat,weight,likes_string
calico,2.1,1
black,5.0,0
tabby,3.2,1
Tip: Editing Text files in R

Alternatively, you can create data/feline-data.csv using a text editor (Nano), or within RStudio with the File -> New File -> Text File menu item.

We can then load this .csv file into R via the following:

cats <- read.csv(file = "data/feline-data.csv")
cats
    coat weight likes_string
1 calico    2.1            1
2  black    5.0            0
3  tabby    3.2            1
Tip: read.table() and delimiters

The read.table function is used for reading in tabular data stored in a text file where the columns of data are separated by punctuation characters such as .csv files (csv = comma-separated values). Tabs and commas are the most common punctuation characters used to separate or delimit data points in csv files. For convenience R provides 2 other versions of read.table. These are: read.csv for files where the data are separated with commas and read.delim for files where the data are separated with tabs. Of these three functions read.csv is the most commonly used. If needed it is possible to override the default delimiting punctuation marks for both read.csv and read.delim.

There are lots of things we can do with our cats data object, such as extracting individual columns using the $ operator:

cats$weight
[1] 2.1 5.0 3.2
cats$coat
[1] "calico" "black"  "tabby" 

Note that each column is a vector.

We can do other operations on the columns, such as:

## Say we discovered that the scale weighs two Kg light:
cats$weight + 2
[1] 4.1 7.0 5.2
paste("My cat is", cats$coat)
[1] "My cat is calico" "My cat is black"  "My cat is tabby" 

But what about:

cats$weight + cats$coat
Error in cats$weight + cats$coat: non-numeric argument to binary operator

Understanding what happened here is key to successfully analyzing data in R.

4.1 Data Types

If you guessed that the last command will return an error because 2.1 plus "black" is nonsense, you’re right - and you already have some intuition for an important concept in programming called data types. We can ask what type or “class” of data something is:

class(cats$weight)
[1] "numeric"

You will typically encounter the following main types: numeric (which encompasses double and integer), logical, character (and factor, but we won’t encounter these until later). There are others too (such as complex), but you’re unlikely to encounter them in your data analysis journeys.

Let’s identify the class of several values:

class(3.14)
[1] "numeric"
class(TRUE)
[1] "logical"
class("banana")
[1] "character"

No matter how complicated our analyses become, all data in R is interpreted as one of these basic data types. This strictness has some really important consequences.

A user has added details of another cat. This information is in the file data/feline-data_v2.csv.

file.show("data/feline-data_v2.csv")
coat,weight,likes_string
calico,2.1,1
black,5.0,0
tabby,3.2,1
tabby,2.3 or 2.4,1

Load the new cats data like before, and check what type of data we find in the weight column:

cats_v2 <- read.csv(file="data/feline-data_v2.csv")
class(cats_v2$weight)
[1] "character"

Oh no, our weights aren’t the numeric class anymore! If we try to do the same math we did on them before, we run into trouble:

cats_v2$weight + 2
Error in cats_v2$weight + 2: non-numeric argument to binary operator

What happened?

The cats data we are working with is something called a data frame. Data frames are one of the most common and versatile types of data structures we will work with in R.

A given column in a data frame can only contain one single data type (but each column can be of a different type).

In this case, R does not read everything in the data frame column weight as numeric (specifically, R reads the entry 2.3 or 2.4 as a character), therefore the entire column data type changes to something that is suitable for everything in the column.

When R reads a csv file, it reads it in as a data frame. Thus, when we loaded the cats csv file, it is stored as a data frame. We can recognize data frames by the first row that is written by the str() function:

str(cats)
'data.frame':   3 obs. of  3 variables:
 $ coat        : chr  "calico" "black" "tabby"
 $ weight      : num  2.1 5 3.2
 $ likes_string: int  1 0 1

Data frames are composed of rows and columns, where each column has the same number of rows. Different columns in a data frame can be made up of different data types (this is what makes them so versatile), but everything in a given column needs to be the same type (e.g., numeric, character, logical, etc).

Let’s explore more about different data structures and how they behave. For now, let’s go back to working with the original feline-data.csv file while we investigate this behavior further:

feline-data.csv:

coat,weight,likes_string
calico,2.1,1
black,5.0,0
tabby,3.2,1
cats <- read.csv(file = "data/feline-data.csv")
cats
    coat weight likes_string
1 calico    2.1            1
2  black    5.0            0
3  tabby    3.2            1

4.2 Vectors and Type Coercion

To better understand this behavior, let’s learn more about the vector. A vector in R is essentially an ordered collection of values, with the special condition that everything in the vector must be the same basic data type.

A vector can be created with the c() “combine” function:

c(1, 8, 1.2)
[1] 1.0 8.0 1.2

The columns of a data frame are also vectors:

cats$weight
[1] 2.1 5.0 3.2

The fact that everything in a vector must be the same type is the root of why R forces everything in a column to be the same basic data type.

4.2.1 Coercion by combining vectors

Because all entries in a vector must have the same type, c() will coerce the type of each element to a common type. Given what we’ve learned so far, what do you think the following will produce?

quiz_vector <- c(2, 6, '3')

This is something called type coercion, and it is the source of many surprises and the reason why we need to be aware of the basic data types and how R will interpret them. When R encounters a mix of types (here numeric and character) to be combined into a single vector, it will force them all to be the same type. Consider:

coercion_vector <- c('a', TRUE)
coercion_vector
[1] "a"    "TRUE"
another_coercion_vector <- c(0, TRUE)
another_coercion_vector
[1] 0 1

4.2.2 The type hierarchy

The coercion rules go: logical -> numeric -> character, where -> can be read as “are transformed into”. For example, combining logical and character transforms the result to character:

c('a', TRUE)
[1] "a"    "TRUE"
Tip

A quick way to recognize character vectors is by the quotes that enclose them when they are printed.

You can try to force coercion against this flow using the as. functions:

character_vector_example <- c('0', '2', '4')
character_vector_example
[1] "0" "2" "4"
character_coerced_to_numeric <- as.numeric(character_vector_example)
character_coerced_to_numeric
[1] 0 2 4
numeric_coerced_to_logical <- as.logical(character_coerced_to_numeric)
numeric_coerced_to_logical
[1] FALSE  TRUE  TRUE

As you can see, some surprising things can happen when R forces one basic data type into another! Nitty-gritty of type coercion aside, the point is: if your data doesn’t look like what you thought it was going to look like, type coercion may well be to blame; make sure everything is the same type in your vectors and your columns of data.frames, or you will get nasty surprises!

But coercion can also be very useful! For example, in our cats data likes_string is numeric, but we know that the 1s and 0s actually represent TRUE and FALSE (a common way of representing them). We should use the logical datatype here, which has two states: TRUE or FALSE, which is exactly what our data represents. We can ‘coerce’ this column to be logical by using the as.logical function:

cats$likes_string
[1] 1 0 1
cats$likes_string <- as.logical(cats$likes_string)
cats$likes_string
[1]  TRUE FALSE  TRUE
Challenge 1

An important part of every data analysis is cleaning the input data. If you know that the input data is all of the same format, (e.g. numbers), your analysis is much easier! In this exercise, you will clean the cat data set from the chapter about type coercion.

4.2.3 Copy the code template

In your quarto file in RStudio, start a new code chunk and copy and paste the following code. Then move on to the tasks below, which will help you to fill in the gaps (______).

# Read data
cats <- read.csv("data/feline-data_v2.csv")
# 1. Print the data
_____

# 2. Show an overview of the table that prints out the type of each column
_____(cats)

# 3. The "weight" column has the incorrect data type __________.
#    The correct data type is: ____________.

# 4. Correct the 4th weight data point with the mean of the two given values
cats$weight[4] <- 2.35
#    print the data again to see the effect
cats

# 5. Convert the weight to the right data type
cats$weight <- ______________(cats$weight)

#    Calculate the mean to test yourself
mean(cats$weight)

# If you see the correct mean value (and not NA), you did the exercise
# correctly!

4.3 Instructions for the tasks

4.3.2 2. Overview of the data types

Use a function we saw earlier to print out the “type” of all columns of the cats table.

Tip 1.2

In the chapter “Data types” we saw two functions that can show data types. One printed just a single word, the data type name. The other printed a short form of the data type, and the first few values. We recommend the second here.

str(cats)
'data.frame':   3 obs. of  3 variables:
 $ coat        : chr  "calico" "black" "tabby"
 $ weight      : num  2.1 5 3.2
 $ likes_string: logi  TRUE FALSE TRUE

4.3.3 3. Which data type do we need?

The shown data type is not the right one for this data (weight of a cat). Which data type do we need?

  • Why did the read.csv() function not choose the correct data type?
  • Fill in the gap in the comment with the correct data type for cat weight!
Tip 1.3

Scroll up to the section about the type hierarchy to review the available data types

Solution to Challenge 1.3
  • Weight is expressed on a continuous scale (real numbers). The R data type for this is “numeric”.
  • The fourth row has the value “2.3 or 2.4”. That is not a number but two, and an english word. Therefore, the “character” data type is chosen. The whole column is now text, because all values in the same columns have to be the same data type.

4.3.4 4. Correct the problematic value

The code to assign a new weight value to the problematic fourth row is given. Think first and then execute it: What will be the data type after assigning a number like in this example? You can check the data type after executing to see if you were right.

Tip 1.4

Revisit the hierarchy of data types when two different data types are combined.

The data type of the column “weight” is “character”. The assigned data type is “numeric”. Combining two data types yields the data type that is higher in the following hierarchy:

logical < numeric < character

Therefore, the column is still of type character! We need to manually convert it to “numeric”.

4.3.5 5. Convert the column “weight” to the correct data type

Cat weight are numbers. But the column does not have this data type yet. Coerce the column to floating point numbers.

Tip 1.5

The functions to convert data types start with as.. You can look for the function further up in the manuscript or use the RStudio auto-complete function: Type “as.” and then press the TAB key.

cats$weight <- as.numeric(cats$weight)

4.4 Some basic functions for creating vectors

The combine function, c(), can also be used both to create a new vector as well as to append things to an existing vector:

ab_vector <- c('a', 'b')
ab_vector
[1] "a" "b"
combine_example <- c(ab_vector, 'z')
combine_example
[1] "a" "b" "z"

You can also make series of numbers using the : syntax as well as the seq() function:

mySeries <- 1:10
mySeries
 [1]  1  2  3  4  5  6  7  8  9 10
seq(10)
 [1]  1  2  3  4  5  6  7  8  9 10
seq(1, 10, by = 0.1)
 [1]  1.0  1.1  1.2  1.3  1.4  1.5  1.6  1.7  1.8  1.9  2.0  2.1  2.2  2.3  2.4
[16]  2.5  2.6  2.7  2.8  2.9  3.0  3.1  3.2  3.3  3.4  3.5  3.6  3.7  3.8  3.9
[31]  4.0  4.1  4.2  4.3  4.4  4.5  4.6  4.7  4.8  4.9  5.0  5.1  5.2  5.3  5.4
[46]  5.5  5.6  5.7  5.8  5.9  6.0  6.1  6.2  6.3  6.4  6.5  6.6  6.7  6.8  6.9
[61]  7.0  7.1  7.2  7.3  7.4  7.5  7.6  7.7  7.8  7.9  8.0  8.1  8.2  8.3  8.4
[76]  8.5  8.6  8.7  8.8  8.9  9.0  9.1  9.2  9.3  9.4  9.5  9.6  9.7  9.8  9.9
[91] 10.0

The head() and tail() functions show the first and last few entries of a vector, respectively.

sequence_example <- 20:25
head(sequence_example, n = 2)
[1] 20 21
tail(sequence_example, n = 4)
[1] 22 23 24 25

The length() function computes the number of entries in the vector:

length(sequence_example)
[1] 6

And the class() function reports the class/type of the values in the vector:

class(sequence_example)
[1] "integer"

We can extract individual elements of a vector by using the square bracket notation:

first_element <- sequence_example[1]
first_element
[1] 20

To change a single element, use the bracket on the other side of the arrow:

sequence_example[1] <- 30
sequence_example
[1] 30 21 22 23 24 25
Challenge 2

Start by making a vector with the numbers 5 through 26. Then:

  • Print out the first three entries of the vector

  • Extract the fourth entry of the vector

  • Multiply the vector by 2.

x <- 5:26
head(x, 3)
[1] 5 6 7
x[4]
[1] 8
x <- x * 2
x
 [1] 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52

5 Names

With names, we can give meaning to elements. It is the first time that we do not only have the data, but also explaining information. It is metadata that can be stuck to the object like a label. In R, this is called an attribute. Some attributes enable us to do more with our object, for example, like here, accessing an element by a self-defined name.

5.1 Accessing vectors by name

Each element of a vector can be given a name:

pizza_price <- c(pizzasubito = 5.64, pizzafresh = 6.60, callapizza = 4.50)

To retrieve a specific named entry from a vector, we can use the square bracket notation:

pizza_price["pizzasubito"]
pizzasubito 
       5.64 

which is equivalent to extracting the first entry of the vector:

pizza_price[1]
pizzasubito 
       5.64 

5.2 Accessing and changing names

If you want to extract just the names of an object, use the names() function:

names(pizza_price)
[1] "pizzasubito" "pizzafresh"  "callapizza" 

We have seen how to access and change single elements of a vector. The same is possible for names:

names(pizza_price)[3]
[1] "callapizza"
names(pizza_price)[3] <- "call-a-pizza"
pizza_price
 pizzasubito   pizzafresh call-a-pizza 
        5.64         6.60         4.50 
Challenge 3

What is the data type of the names of pizza_price? You can find out using the str() or class() functions.

You get the names of an object by wrapping the object name inside names(...). Similarly, you get the data type of the names by again wrapping the whole code in class(...):

class(names(pizza_price))
[1] "character"

alternatively, use a new variable if this is easier for you to read:

names <- names(pizza_price)
class(names)
[1] "character"
Challenge 4

Instead of just changing the names of each element of a vector individually, you can also set all names of an object by writing code like (replace ALL CAPS text):

names( OBJECT ) <-  CHARACTER_VECTOR

Create a vector that gives the number for each letter in the alphabet!

  1. Generate a vector called letter_no with the sequence of numbers from 1 to 26
  2. R has a built-in object called LETTERS (type LETTERS in the console. It is a 26-character vector of uppercase letters from A to Z. Set the names of letter_no to these 26 letters
  3. Test yourself by calling letter_no["B"], which should give you the number 2!
letter_no <- 1:26   # or seq(1,26)
names(letter_no) <- LETTERS
letter_no["B"]
B 
2 

6 Data frames

We introduced data frames at the very beginning of this lesson, they represent a table of data. Recall our cats data frame:

cats
    coat weight likes_string
1 calico    2.1         TRUE
2  black    5.0        FALSE
3  tabby    3.2         TRUE

Columns of a data frame are vectors of different types, but of the same length, that are organized by belonging to the same table.

In our cats example, we have an character, a numeric, and a logical column/variable. As we have seen already, each column of data.frame is a vector.

6.0.1 Extracting information from a data frame

There are several ways to extract an individual column in a data frame, including using the $ notation that we used above:

cats$coat
[1] "calico" "black"  "tabby" 

But a column can also be accessed using the square bracket notation:

cats[, 1]
[1] "calico" "black"  "tabby" 

which returns the column as a vector.

The syntax df[i, j] to extract the ith row and the jth column from the data frame called df.

A blank i or j this tells R to extract all of the rows or columns, so df[, 1] will extract all rows for the 1st column. df[2, ], however, will extract the second row across all columns. df[3, 1] will extract the single entry in the third row and first column.

6.0.2 The square bracket syntax [ ]

  • df[, j] will extract the jth column from the data frame called df as a vector.

  • df[i, ] will extract the ith row from the data frame called df as a data frame.

For example the following code extracts the data from the second column of cats as a vector

cats[, 2]
[1] 2.1 5.0 3.2

and the following code extracts the second row of cats as a data frame:

cats[2, ]
   coat weight likes_string
2 black      5        FALSE

Note, to extract the \(j\)th column from a data frame as a single-column data frame, you can use the single-dimension square bracket syntax: df[j].

cats[2]
  weight
1    2.1
2    5.0
3    3.2

This syntax also works with named indexing.

cats["weight"]
  weight
1    2.1
2    5.0
3    3.2

We will explain a bit more about why this works momentarily when we introduce lists.

Challenge 5

There are several subtly different ways to call variables, observations and elements from data.frames:

  • cats[1]
  • cats$coat
  • cats["coat"]
  • cats[1, 1]
  • cats[, 1]
  • cats[1, ]

Try out these examples and explain what is returned by each one.

Hint: Use the function class() to examine what is returned in each case.

cats[1]
    coat
1 calico
2  black
3  tabby

We can think of a data frame as a list of vectors. The single brace [1] returns the first slice of the list, as another list. In this case it is the first column of the data frame.

cats$coat
[1] "calico" "black"  "tabby" 

This example uses the $ character to address items by name. coat is the first column of the data frame, again a vector of type character.

cats["coat"]
    coat
1 calico
2  black
3  tabby

Here we are using a single brace ["coat"] replacing the index number with the column name. Like example 1, the returned object is a list.

cats[1, 1]
[1] "calico"

This example uses a single brace, but this time we provide row and column coordinates. The returned object is the value in row 1, column 1. The object is a vector of type character.

cats[, 1]
[1] "calico" "black"  "tabby" 

Like the previous example we use single braces and provide row and column coordinates. The row coordinate is not specified, R interprets this missing value as all the elements in this column and returns them as a vector.

cats[1, ]
    coat weight likes_string
1 calico    2.1         TRUE

Again we use the single brace with row and column coordinates. The column coordinate is not specified. The return value is a list containing all the values in the first row.

Tip: Renaming data frame columns

Like vectors, data frames have column names, which can be accessed with the names() function.

names(cats)
[1] "coat"         "weight"       "likes_string"

If you want to rename the second column of cats, you can assign a new name to the second element of names(cats).

names(cats)[2] <- "weight_kg"
cats
    coat weight_kg likes_string
1 calico       2.1         TRUE
2  black       5.0        FALSE
3  tabby       3.2         TRUE

6.1 Lists

A data frame is technically a special case of a list.

List are very flexible because you can put anything you want in it: unlike a vector, the elements of a list can have different data types. For example:

list_example <- list(1, "a", TRUE)
list_example
[[1]]
[1] 1

[[2]]
[1] "a"

[[3]]
[1] TRUE

Like a vector, the “length” of a list corresponds to how many entries it contains:

length(list_example)
[1] 3

When printing the object structure with str(), we see the data types of all elements:

str(list_example)
List of 3
 $ : num 1
 $ : chr "a"
 $ : logi TRUE

To retrieve one of the elements of a list, we use the double bracket notation:

list_example[[2]]
[1] "a"

The elements of lists also can have names, they can be given by prepending them to the values, separated by an equals sign:

another_list <- list(title = "Numbers", numbers = 1:10, data = TRUE)
another_list
$title
[1] "Numbers"

$numbers
 [1]  1  2  3  4  5  6  7  8  9 10

$data
[1] TRUE

This results in a named list. Now we have a new function of our object! We can access single elements by an additional way!

another_list$title
[1] "Numbers"

as well as using named indexing in the double square bracket notation.

another_list[["title"]]
[1] "Numbers"

Lists, it turns out, can become a lot more complicated than vectors. While each entry of a vector is just a single value, each entry of a list can be any type of object, including vectors and data frames. For example, the following list of length three contains three entries: a numeric vector, a data frame, and a single character value:

complicated_list <- list(vec = c(1, 2, 9),
                         dataframe = cats, 
                         single_value = "a")
complicated_list
$vec
[1] 1 2 9

$dataframe
    coat weight likes_string
1 calico    2.1            1
2  black    5.0            0
3  tabby    3.2            1

$single_value
[1] "a"
Challenge 6

Create a list of length two containing a character vector containing the letters “x”, “y”, “z” and a data frame with two columns that looks like this.

    name grade
1  Henry     A
2 Hannah     B
3 Harvey     C

Your list output should look like this:

[[1]]
[1] "x" "y" "z"

[[2]]
    name grade
1  Henry     A
2 Hannah     B
3 Harvey     C
list(c("x", "y", "z"),
     data.frame(name = c("Henry", "Hannah", "Harvey"), grade = c("A", "B", "C")))
[[1]]
[1] "x" "y" "z"

[[2]]
    name grade
1  Henry     A
2 Hannah     B
3 Harvey     C

6.1.1 Data frames as a special case of a list

It turns out that a data frame is a special kind of a list. Specifically, a data frame is a list of vectors of the same length.

This is why you can extract vector columns from a data frame using the double brackets notation:

cats
    coat weight likes_string
1 calico    2.1            1
2  black    5.0            0
3  tabby    3.2            1
cats[["coat"]]
[1] "calico" "black"  "tabby" 

Note that the df[i, j] index notation is specific to data frames (and does not work for lists).

We will learn more about extracting information from vectors, lists and data frames shortly.